Drug Consumption (Quantified) Dataset

Introduction

Drug addiction has long been hypothesized to be associated with the one’s own psychology. Some popular psychometric measures like the NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), the BIS-11 (impulsivity) and the ImpSS (sensation seeking) are meant to model the personality traits of a person. These measurements along with other controls like age, gender, education, country and ethnicity are collected to model for several indicators of substance abuse for different drugs.

A number of literatures have attempted to illustrate the relationship between the personality traits and substance abuse. For example, Roncero et al. noted the relationship between high N and cocaine-induced drug consumption. Another study from Vollrath & Torgersen noted that a low score C and a high score of N or E correlate strongly with hazardous health behaviors. Since substance abuse has long been considered as a pressing economic problem, we could attempt to model the relationships between covariates and the indicators for substance abuse so that policies could better target specific areas to maximize payoffs.

Data Overview [Source]

Instead of coding the categorical variables as factors, the dataset scales them so that the variables can better describe the distances in the metric space and appear as numeric.

A summary of the variables and their respective codes are as following:

Attributes Description
age age of participant
gender gender of participant
education level of education
country country of current residence of participant
ethnicity ethnicity of participant
Nscore NEO-FFI-R Neuroticism
Escore NEO-FFI-R Extraversion
Oscore NEO-FFI-R Openness to experience
Ascore NEO-FFI-R Agreeableness
Cscore NEO-FFI-R Conscientiousness
Impulsive impulsiveness measured by BIS-11
SS sensation seeing measured by ImpSS
alcohol class of alcohol consumption
amphet class of amphetamines consumption
amyl class of amyl nitrite consumption
benzo class of benzodiazepine consumption
caff class of caffeine consumption
cannabis class of cannabis consumption
choc class of chocolate consumption
coke class of cocaine consumption
crack class of crack consumption
ecstasy class of ecstasy consumption
heroin class of heroin consumption
ketamine class of ketamine consumption
legalh class of legal highs consumption
lsd class of lsd consumption
meth class of methadone consumption
mushroom class of magic mushrooms consumption
nicotine class of nicotine consumption
semer class of fictitious drug Semeron consumption
vsa class of volatile substance abuse consumption
  1. age
Value Meaning
-0.95197 18-24
-0.07854 25-34
0.49788 35-44
1.09449 45-54
1.82213 55-64
2.59171 65+
  1. gender
Value Meaning
0.48246 Female
-0.48246 Male
  1. education
Value Meaning
-2.43591 Left school before 16 years
-1.73790 Left school at 16 years
-1.43719 Left school at 17 years
-1.22751 Left school at 18 years
-0.61113 Some college or university, no certificate or degree
-0.05921 Professional certificate/ diploma
0.45468 University degree
1.16365 Masters degree
1.98437 Doctorate degree
  1. country
Value Meaning
-0.09765 Australia
0.24923 Canada
-0.46841 New Zealand
-0.28519 Other
0.21128 Republic of Ireland
0.96082 UK
-0.57009 USA
  1. ethnicity
Value Meaning
-0.50212 Asian
-1.10702 Black
1.90725 Mixed-Black/Asian
0.12600 Mixed-White/Asian
-0.22166 Mixed-White/Black
0.11440 Other
-0.31685 White

The personality scores are calculated on a “continuous” scale so they are coded continuously, whie the substances are coded as the following:

Value Meaning
CLO Never Used
CL1 Used over a Decade Ago
CL2 Used in Last Decade
CL3 Used in Last Year
CL4 Used in Last Month
CL5 Used in Last Week
CL6 Used in Last Day

Descriptive Data Analysis

We could probe the dataset as following:

# Reading the data from source

drugs = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data", header = F)[-1]

# Naming the columns in order of the description on source.

colnames(drugs) = c(
  "age",
  "gender",
  "education",
  "country",
  "ethinicity",
  "Nscore",
  "Escore",
  "Oscore",
  "Ascore",
  "Cscore",
  "Impulsive",
  "SS",
  "alcohol",
  "amphet",
  "amyl",
  "benzo",
  "caff",
  "cannabis",
  "choc",
  "coke",
  "crack",
  "ecstasy",
  "heroin",
  "ketamine",
  "legalh",
  "lsd",
  "meth",
  "mushroom",
  "nicotine",
  "semer",
  "vsa"
  )

# Dimension of dataset

dim(drugs)
## [1] 1885   31
# First five observations of the dataset

head(drugs, 5)
##        age   gender education country ethinicity   Nscore   Escore
## 1  0.49788  0.48246  -0.05921 0.96082    0.12600  0.31287 -0.57545
## 2 -0.07854 -0.48246   1.98437 0.96082   -0.31685 -0.67825  1.93886
## 3  0.49788 -0.48246  -0.05921 0.96082   -0.31685 -0.46725  0.80523
## 4 -0.95197  0.48246   1.16365 0.96082   -0.31685 -0.14882 -0.80615
## 5  0.49788  0.48246   1.98437 0.96082   -0.31685  0.73545 -1.63340
##     Oscore   Ascore   Cscore Impulsive       SS alcohol amphet amyl benzo
## 1 -0.58331 -0.91699 -0.00665  -0.21712 -1.18084     CL5    CL2  CL0   CL2
## 2  1.43533  0.76096 -0.14277  -0.71126 -0.21575     CL5    CL2  CL2   CL0
## 3 -0.84732 -1.62090 -1.01450  -1.37983  0.40148     CL6    CL0  CL0   CL0
## 4 -0.01928  0.59042  0.58489  -1.37983 -1.18084     CL4    CL0  CL0   CL3
## 5 -0.45174 -0.30172  1.30612  -0.21712 -0.21575     CL4    CL1  CL1   CL0
##   caff cannabis choc coke crack ecstasy heroin ketamine legalh lsd meth
## 1  CL6      CL0  CL5  CL0   CL0     CL0    CL0      CL0    CL0 CL0  CL0
## 2  CL6      CL4  CL6  CL3   CL0     CL4    CL0      CL2    CL0 CL2  CL3
## 3  CL6      CL3  CL4  CL0   CL0     CL0    CL0      CL0    CL0 CL0  CL0
## 4  CL5      CL2  CL4  CL2   CL0     CL0    CL0      CL2    CL0 CL0  CL0
## 5  CL6      CL3  CL6  CL0   CL0     CL1    CL0      CL0    CL1 CL0  CL0
##   mushroom nicotine semer vsa
## 1      CL0      CL2   CL0 CL0
## 2      CL0      CL4   CL0 CL0
## 3      CL1      CL0   CL0 CL0
## 4      CL0      CL2   CL0 CL0
## 5      CL2      CL2   CL0 CL0
# Summary of the dataset

summary(drugs)
##       age               gender             education        
##  Min.   :-0.95197   Min.   :-0.4824600   Min.   :-2.435910  
##  1st Qu.:-0.95197   1st Qu.:-0.4824600   1st Qu.:-0.611130  
##  Median :-0.07854   Median :-0.4824600   Median :-0.059210  
##  Mean   : 0.03461   Mean   :-0.0002559   Mean   :-0.003806  
##  3rd Qu.: 0.49788   3rd Qu.: 0.4824600   3rd Qu.: 0.454680  
##  Max.   : 2.59171   Max.   : 0.4824600   Max.   : 1.984370  
##                                                             
##     country          ethinicity          Nscore         
##  Min.   :-0.5701   Min.   :-1.1070   Min.   :-3.464360  
##  1st Qu.:-0.5701   1st Qu.:-0.3169   1st Qu.:-0.678250  
##  Median : 0.9608   Median :-0.3169   Median : 0.042570  
##  Mean   : 0.3555   Mean   :-0.3096   Mean   : 0.000047  
##  3rd Qu.: 0.9608   3rd Qu.:-0.3169   3rd Qu.: 0.629670  
##  Max.   : 0.9608   Max.   : 1.9072   Max.   : 3.273930  
##                                                         
##      Escore              Oscore              Ascore         
##  Min.   :-3.273930   Min.   :-3.273930   Min.   :-3.464360  
##  1st Qu.:-0.695090   1st Qu.:-0.717270   1st Qu.:-0.606330  
##  Median : 0.003320   Median :-0.019280   Median :-0.017290  
##  Mean   :-0.000163   Mean   :-0.000534   Mean   :-0.000245  
##  3rd Qu.: 0.637790   3rd Qu.: 0.723300   3rd Qu.: 0.760960  
##  Max.   : 3.273930   Max.   : 2.901610   Max.   : 3.464360  
##                                                             
##      Cscore            Impulsive               SS            alcohol  
##  Min.   :-3.464360   Min.   :-2.555240   Min.   :-2.078480   CL0: 34  
##  1st Qu.:-0.652530   1st Qu.:-0.711260   1st Qu.:-0.525930   CL1: 34  
##  Median :-0.006650   Median :-0.217120   Median : 0.079870   CL2: 68  
##  Mean   :-0.000386   Mean   : 0.007216   Mean   :-0.003292   CL3:198  
##  3rd Qu.: 0.584890   3rd Qu.: 0.529750   3rd Qu.: 0.765400   CL4:287  
##  Max.   : 3.464360   Max.   : 2.901610   Max.   : 1.921730   CL5:759  
##                                                              CL6:505  
##  amphet     amyl      benzo       caff      cannabis   choc      coke     
##  CL0:976   CL0:1305   CL0:1000   CL0:  27   CL0:413   CL0: 32   CL0:1038  
##  CL1:230   CL1: 210   CL1: 116   CL1:  10   CL1:207   CL1:  3   CL1: 160  
##  CL2:243   CL2: 237   CL2: 234   CL2:  24   CL2:266   CL2: 10   CL2: 270  
##  CL3:198   CL3:  92   CL3: 236   CL3:  60   CL3:211   CL3: 54   CL3: 258  
##  CL4: 75   CL4:  24   CL4: 120   CL4: 106   CL4:140   CL4:296   CL4:  99  
##  CL5: 61   CL5:  14   CL5:  84   CL5: 273   CL5:185   CL5:683   CL5:  41  
##  CL6:102   CL6:   3   CL6:  95   CL6:1385   CL6:463   CL6:807   CL6:  19  
##  crack      ecstasy    heroin     ketamine   legalh      lsd      
##  CL0:1627   CL0:1021   CL0:1605   CL0:1490   CL0:1094   CL0:1069  
##  CL1:  67   CL1: 113   CL1:  68   CL1:  45   CL1:  29   CL1: 259  
##  CL2: 112   CL2: 234   CL2:  94   CL2: 142   CL2: 198   CL2: 177  
##  CL3:  59   CL3: 277   CL3:  65   CL3: 129   CL3: 323   CL3: 214  
##  CL4:   9   CL4: 156   CL4:  24   CL4:  42   CL4: 110   CL4:  97  
##  CL5:   9   CL5:  63   CL5:  16   CL5:  33   CL5:  64   CL5:  56  
##  CL6:   2   CL6:  21   CL6:  13   CL6:   4   CL6:  67   CL6:  13  
##   meth      mushroom  nicotine  semer       vsa      
##  CL0:1429   CL0:982   CL0:428   CL0:1877   CL0:1455  
##  CL1:  39   CL1:209   CL1:193   CL1:   2   CL1: 200  
##  CL2:  97   CL2:260   CL2:204   CL2:   3   CL2: 135  
##  CL3: 149   CL3:275   CL3:185   CL3:   2   CL3:  61  
##  CL4:  50   CL4:115   CL4:108   CL4:   1   CL4:  13  
##  CL5:  48   CL5: 40   CL5:157              CL5:  14  
##  CL6:  73   CL6:  4   CL6:610              CL6:   7
# Structure of the dataset

str(drugs)
## 'data.frame':    1885 obs. of  31 variables:
##  $ age       : num  0.4979 -0.0785 0.4979 -0.952 0.4979 ...
##  $ gender    : num  0.482 -0.482 -0.482 0.482 0.482 ...
##  $ education : num  -0.0592 1.9844 -0.0592 1.1637 1.9844 ...
##  $ country   : num  0.961 0.961 0.961 0.961 0.961 ...
##  $ ethinicity: num  0.126 -0.317 -0.317 -0.317 -0.317 ...
##  $ Nscore    : num  0.313 -0.678 -0.467 -0.149 0.735 ...
##  $ Escore    : num  -0.575 1.939 0.805 -0.806 -1.633 ...
##  $ Oscore    : num  -0.5833 1.4353 -0.8473 -0.0193 -0.4517 ...
##  $ Ascore    : num  -0.917 0.761 -1.621 0.59 -0.302 ...
##  $ Cscore    : num  -0.00665 -0.14277 -1.0145 0.58489 1.30612 ...
##  $ Impulsive : num  -0.217 -0.711 -1.38 -1.38 -0.217 ...
##  $ SS        : num  -1.181 -0.216 0.401 -1.181 -0.216 ...
##  $ alcohol   : Factor w/ 7 levels "CL0","CL1","CL2",..: 6 6 7 5 5 3 7 6 5 7 ...
##  $ amphet    : Factor w/ 7 levels "CL0","CL1","CL2",..: 3 3 1 1 2 1 1 1 1 2 ...
##  $ amyl      : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 3 1 1 2 1 1 1 1 1 ...
##  $ benzo     : Factor w/ 7 levels "CL0","CL1","CL2",..: 3 1 1 4 1 1 1 1 1 2 ...
##  $ caff      : Factor w/ 7 levels "CL0","CL1","CL2",..: 7 7 7 6 7 7 7 7 7 7 ...
##  $ cannabis  : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 5 4 3 4 1 2 1 1 2 ...
##  $ choc      : Factor w/ 7 levels "CL0","CL1","CL2",..: 6 7 5 5 7 5 6 5 7 7 ...
##  $ coke      : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 4 1 3 1 1 1 1 1 1 ...
##  $ crack     : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ecstasy   : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 5 1 1 2 1 1 1 1 1 ...
##  $ heroin    : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ketamine  : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 3 1 3 1 1 1 1 1 1 ...
##  $ legalh    : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ lsd       : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 3 1 1 1 1 1 1 1 1 ...
##  $ meth      : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 4 1 1 1 1 1 1 1 1 ...
##  $ mushroom  : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 2 1 3 1 1 1 1 1 ...
##  $ nicotine  : Factor w/ 7 levels "CL0","CL1","CL2",..: 3 5 1 3 3 7 7 1 7 7 ...
##  $ semer     : Factor w/ 5 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ vsa       : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...

Now, after that brief introduction, we are ready to examine our data more closely.

Explanatory Data Analysis

pkg_list = c("tidyverse", "corrplot", "gridExtra")
mia_pkgs = pkg_list[!(pkg_list %in% installed.packages()[,"Package"])]
if(length(mia_pkgs) > 0) install.packages(mia_pkgs)
loaded_pkgs = lapply(pkg_list, require, character.only=TRUE)

Correlation matrix:

# correlation between covariates

corr = cor(drugs[1:12]) %>% round(1)

corrplot(corr, method = "color", addCoef.col = "black", type = "upper")

We noticed that none of the covariates are strongly correlated, so we can assume that none of them are collinear.

Now we will try to find the correlation between some potential dependent variables and independent variables. Based on the data description, we have 20 dependent variables with 6 classes each. Since it is hard to visualize the correlation with factor variables (since the covariates are technically factor variables before quantified, however still random), then we will use the correlogram to visualize the relationships like previously.

y = lapply(drugs[13:31], as.character) %>% 
  lapply(., FUN = str_extract, pattern = "[:digit:]") %>% 
  lapply(., as.numeric) %>% 
  do.call(cbind, .)

corr2 = round(cor(drugs[1:12], y), 1)

corrplot(corr2, method = "color", addCoef.col = "black")

Scatter plots:

Now, we will attempt to plot 9 of the most correlated relationships.

y.new = drugs[, c("cannabis", "lsd", "mushroom")]
x.new = drugs[, c("country", "Oscore", "SS")]

drug.new = cbind.data.frame(x.new, y.new)


# Plotting y variables against x variables (with geom_jitter to see density at each point)

count = 0

for(i in 1:ncol(x.new)){
  for(j in 1:ncol(y.new)){
    count = count+1
    assign(paste0("p", count),
           ggplot(drug.new, aes_string(x = names(x.new[i]), y = names(y.new[j]))) + 
             theme_bw() + 
             geom_jitter(color = "darkblue", alpha = 0.3) +
             labs(title = paste(names(y.new[j]), "vs.", names(x.new[i])))
           )
  }
}


# Printing the plots

mget(paste0("p", 1:9)) 
## $p1

## 
## $p2

## 
## $p3

## 
## $p4

## 
## $p5

## 
## $p6

## 
## $p7

## 
## $p8

## 
## $p9

As we can see, the relationships are certainly unusual given the nature of our dataset. However correlated, some covariates appear to be discrete. For this reason, we would have to utilize the variable selection algorithms that would perform well even if the variables are not continuous.

Density plots:

If we look at the density plots for each personality scores, we will find that each of them approximate a normal distribution.

# Extracting the personality scores

scores = drugs[6:12]

for(i in 1:ncol(scores)){
  assign( paste0("score", i),
          ggplot(drugs, aes_string(x = names(scores[i]))) +
            geom_density(color = "darkblue") +
            theme_bw() +
            scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
            labs(title = names(scores[i]))
  )
}

# Plotting

grid.arrange(score1, score2, score3, score4, score5, score6, score7, 
             nrow = 2)

If we assume independence, then \(\sum_{i= 1}^{7} Z_i \sim N(\sum_{i=1}^{7} \mu_i, \sum_{i=1}^{7} \sigma_i^2)\), which could prove useful for hypothesis testing of parametric models or for creating another normal variable.

We could also see that the distributions have already been standardized.

# Mean
(mu = colMeans(scores))
##        Nscore        Escore        Oscore        Ascore        Cscore 
##  4.660477e-05 -1.628011e-04 -5.343979e-04 -2.449655e-04 -3.860690e-04 
##     Impulsive            SS 
##  7.216064e-03 -3.291666e-03
# Variance
(sigma = apply(scores, 2, var))
##    Nscore    Escore    Oscore    Ascore    Cscore Impulsive        SS 
## 0.9962155 0.9949035 0.9924713 0.9948875 0.9950514 0.9109452 0.9287199

Now, if we investigate the dependent variables, we could uncover a few interesting information.

# Changing from wide to long format

long = drugs %>% 
  gather(type, class, alcohol:vsa)

# Plotting frequencies by substance

ggplot(long, aes(x = class, y = ..count.., fill = type)) + 
  geom_bar() + 
  facet_wrap(~type) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90))

Since the classes are coded as CL0 for “Never Used” and CL1:6 for “Has Been Used At Least Once”, then we can do the following transformation:

# Code "CL0" as 0 and the rest as 1

substance.binary = apply(drugs[,13:31], 2, 
                         FUN = function(i) ifelse(i == "CL0", yes = "0", no = "1")) %>% 
  data.frame

# Changing from long to wide format

long2 = substance.binary %>% 
  gather(type, class, alcohol:vsa)

# Saving the binary code in a dataset

drugs.binary = drugs %>% .[-(13:31)] %>% cbind.data.frame(., substance.binary)

# Plotting

ggplot(long2, aes(x = class, y = ..count.., fill = type)) + 
  geom_bar() + 
  facet_wrap(~type) + 
  theme_bw()

We can see that almost all drugs have “takers”, except Semeron, which seems to have significantly more “never-takers”. This is indicated in page 4 of the data description that Semeron is meant to be a fictitious drug to identify over-claimers.

Conclusion

Evidently, this is a classification problem that we could attempt to solve by using the different Machine Learning techniques. Some objectives that we could attempt to solve are:

  • to identify the association of personality profiles with drug consumption.
  • to predict the risk of drug consumption for each individual according to their personality profiles.
  • to better binarize each variable according to the relationships.